Skip to content

feat: add stringAgg aggregate function#1382

Open
KyleAMathews wants to merge 3 commits intomainfrom
string-agg
Open

feat: add stringAgg aggregate function#1382
KyleAMathews wants to merge 3 commits intomainfrom
string-agg

Conversation

@KyleAMathews
Copy link
Collaborator

Summary

Adds a stringAgg aggregate function that concatenates string values within groups, with configurable separators and ordering. Available across the full stack: query builder, compiler, and IVM engine.

Approach

Two-tier IVM architecture:

  • Stateful incremental path (used by the query compiler): Maintains a sorted array of entries per group with O(log n) binary search for insertion/removal. Fast-path string slicing for head/tail changes avoids full rebuilds — only middle-position mutations trigger a rebuild from the ordered entries.
  • Stateless fallback path (no rowKeyExtractor): Simple sort-and-join on each update. Used when row identity isn't available.

Query builder overloads disambiguate the flexible API surface:

stringAgg(value)                        // default order, no separator
stringAgg(value, orderBy)               // ordered, no separator  
stringAgg(value, separator)             // separator, default order
stringAgg(value, separator, orderBy)    // both

The compiler distinguishes separator (literal string) from orderBy (column reference) by expression type, and always provides a rowKeyExtractor to enable the incremental path.

Key invariants

  • entriesByKey and orderedEntries must stay synchronized — removeStringAggEntry now throws on desynchronization rather than silently returning
  • Fast-path text splicing is only used for first/last position changes; middle mutations set textDirty for a full rebuild
  • Null/undefined values are excluded from concatenation (consistent with SQL string_agg semantics)

Non-goals

  • Custom comparators for ordering (uses built-in comparison with Date normalization)
  • Streaming/lazy concatenation for very large groups

Verification

pnpm vitest run packages/db-ivm/tests/operators/groupBy.test.ts packages/db/tests/query/builder/functions.test.ts packages/db/tests/query/group-by.test.ts packages/db/tests/query/group-by.test-d.ts

All 140 tests pass, including new coverage for:

  • Incremental inserts/removes/updates with ordering
  • Group deletion + re-creation (cleanup path)
  • Fallback path (no rowKeyExtractor)
  • Values containing the separator string
  • Builder overload disambiguation
  • Type-level tests for return types

Files changed

File Change
packages/db-ivm/src/operators/groupBy.ts Core stringAgg implementation with binary search, incremental text maintenance, and fallback path
packages/db-ivm/src/operators/reduce.ts Pass group key to reduction function
packages/db/src/query/builder/functions.ts stringAgg builder with 4 overloads and OrderByLike type
packages/db/src/query/compiler/group-by.ts Compiler case mapping builder IR to IVM stringAgg
packages/db/src/query/index.ts Re-export stringAgg
docs/guides/live-queries.md Usage examples and API reference
Test files (4) Comprehensive coverage across IVM, builder, compiler, and type levels

🤖 Generated with Claude Code

KyleAMathews and others added 3 commits March 17, 2026 17:05
Adds a new stringAgg aggregate function that concatenates string values
within groups with configurable separator and ordering. Uses a two-tier
architecture: a stateful incremental path with O(log n) binary search
for efficient delta updates, and a stateless fallback for simpler use
cases. Includes query builder overloads, compiler integration, docs,
and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 17, 2026

More templates

@tanstack/angular-db

npm i https://pkg.pr.new/TanStack/db/@tanstack/angular-db@1382

@tanstack/db

npm i https://pkg.pr.new/TanStack/db/@tanstack/db@1382

@tanstack/db-browser-wa-sqlite-persisted-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/db-browser-wa-sqlite-persisted-collection@1382

@tanstack/db-ivm

npm i https://pkg.pr.new/TanStack/db/@tanstack/db-ivm@1382

@tanstack/db-react-native-sqlite-persisted-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/db-react-native-sqlite-persisted-collection@1382

@tanstack/db-sqlite-persisted-collection-core

npm i https://pkg.pr.new/TanStack/db/@tanstack/db-sqlite-persisted-collection-core@1382

@tanstack/electric-db-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/electric-db-collection@1382

@tanstack/offline-transactions

npm i https://pkg.pr.new/TanStack/db/@tanstack/offline-transactions@1382

@tanstack/powersync-db-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/powersync-db-collection@1382

@tanstack/query-db-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/query-db-collection@1382

@tanstack/react-db

npm i https://pkg.pr.new/TanStack/db/@tanstack/react-db@1382

@tanstack/rxdb-db-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/rxdb-db-collection@1382

@tanstack/solid-db

npm i https://pkg.pr.new/TanStack/db/@tanstack/solid-db@1382

@tanstack/svelte-db

npm i https://pkg.pr.new/TanStack/db/@tanstack/svelte-db@1382

@tanstack/trailbase-db-collection

npm i https://pkg.pr.new/TanStack/db/@tanstack/trailbase-db-collection@1382

@tanstack/vue-db

npm i https://pkg.pr.new/TanStack/db/@tanstack/vue-db@1382

commit: 4888335

@github-actions
Copy link
Contributor

Size Change: +272 B (+0.25%)

Total Size: 110 kB

Filename Size Change
./packages/db/dist/esm/index.js 2.88 kB +20 B (+0.7%)
./packages/db/dist/esm/query/builder/functions.js 872 B +80 B (+10.1%) ⚠️
./packages/db/dist/esm/query/compiler/group-by.js 2.86 kB +172 B (+6.39%) 🔍
ℹ️ View Unchanged
Filename Size
./packages/db/dist/esm/collection/change-events.js 1.39 kB
./packages/db/dist/esm/collection/changes.js 1.38 kB
./packages/db/dist/esm/collection/cleanup-queue.js 810 B
./packages/db/dist/esm/collection/events.js 434 B
./packages/db/dist/esm/collection/index.js 3.69 kB
./packages/db/dist/esm/collection/indexes.js 2.35 kB
./packages/db/dist/esm/collection/lifecycle.js 1.76 kB
./packages/db/dist/esm/collection/mutations.js 2.47 kB
./packages/db/dist/esm/collection/state.js 5.2 kB
./packages/db/dist/esm/collection/subscription.js 3.71 kB
./packages/db/dist/esm/collection/sync.js 2.43 kB
./packages/db/dist/esm/collection/transaction-metadata.js 144 B
./packages/db/dist/esm/deferred.js 207 B
./packages/db/dist/esm/errors.js 4.83 kB
./packages/db/dist/esm/event-emitter.js 748 B
./packages/db/dist/esm/indexes/auto-index.js 777 B
./packages/db/dist/esm/indexes/base-index.js 766 B
./packages/db/dist/esm/indexes/btree-index.js 2.17 kB
./packages/db/dist/esm/indexes/lazy-index.js 1.24 kB
./packages/db/dist/esm/indexes/reverse-index.js 538 B
./packages/db/dist/esm/local-only.js 890 B
./packages/db/dist/esm/local-storage.js 2.1 kB
./packages/db/dist/esm/optimistic-action.js 359 B
./packages/db/dist/esm/paced-mutations.js 496 B
./packages/db/dist/esm/proxy.js 3.75 kB
./packages/db/dist/esm/query/builder/index.js 5.15 kB
./packages/db/dist/esm/query/builder/ref-proxy.js 1.05 kB
./packages/db/dist/esm/query/compiler/evaluators.js 1.62 kB
./packages/db/dist/esm/query/compiler/expressions.js 430 B
./packages/db/dist/esm/query/compiler/index.js 3.62 kB
./packages/db/dist/esm/query/compiler/joins.js 2.11 kB
./packages/db/dist/esm/query/compiler/order-by.js 1.5 kB
./packages/db/dist/esm/query/compiler/select.js 1.11 kB
./packages/db/dist/esm/query/effect.js 4.78 kB
./packages/db/dist/esm/query/expression-helpers.js 1.43 kB
./packages/db/dist/esm/query/ir.js 784 B
./packages/db/dist/esm/query/live-query-collection.js 360 B
./packages/db/dist/esm/query/live/collection-config-builder.js 7.63 kB
./packages/db/dist/esm/query/live/collection-registry.js 264 B
./packages/db/dist/esm/query/live/collection-subscriber.js 1.94 kB
./packages/db/dist/esm/query/live/internal.js 145 B
./packages/db/dist/esm/query/live/utils.js 1.57 kB
./packages/db/dist/esm/query/optimizer.js 2.62 kB
./packages/db/dist/esm/query/predicate-utils.js 2.97 kB
./packages/db/dist/esm/query/query-once.js 359 B
./packages/db/dist/esm/query/subset-dedupe.js 960 B
./packages/db/dist/esm/scheduler.js 1.3 kB
./packages/db/dist/esm/SortedMap.js 1.3 kB
./packages/db/dist/esm/strategies/debounceStrategy.js 247 B
./packages/db/dist/esm/strategies/queueStrategy.js 428 B
./packages/db/dist/esm/strategies/throttleStrategy.js 246 B
./packages/db/dist/esm/transactions.js 2.9 kB
./packages/db/dist/esm/utils.js 927 B
./packages/db/dist/esm/utils/browser-polyfills.js 304 B
./packages/db/dist/esm/utils/btree.js 5.61 kB
./packages/db/dist/esm/utils/comparison.js 1.05 kB
./packages/db/dist/esm/utils/cursor.js 457 B
./packages/db/dist/esm/utils/index-optimization.js 1.54 kB
./packages/db/dist/esm/utils/type-guards.js 157 B
./packages/db/dist/esm/virtual-props.js 360 B

compressed-size-action::db-package-size

@github-actions
Copy link
Contributor

Size Change: 0 B

Total Size: 4.23 kB

ℹ️ View Unchanged
Filename Size
./packages/react-db/dist/esm/index.js 249 B
./packages/react-db/dist/esm/useLiveInfiniteQuery.js 1.32 kB
./packages/react-db/dist/esm/useLiveQuery.js 1.34 kB
./packages/react-db/dist/esm/useLiveQueryEffect.js 355 B
./packages/react-db/dist/esm/useLiveSuspenseQuery.js 559 B
./packages/react-db/dist/esm/usePacedMutations.js 401 B

compressed-size-action::react-db-package-size

Copy link
Contributor

@kevin-dp kevin-dp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1382 Review: stringAgg aggregate function

Overall assessment

meme

The implementation is sound for its intended use cases and the tests are thorough. But there are real architectural concerns, one potential correctness issue, and some unnecessary complexity worth discussing.

1. It's not truly incremental — the reduce function is O(n) every time

The reduce function rebuilds nextEntriesByKey by scanning all entries in the Index for the group on every invocation:

// groupBy.ts ~line 430-457
const nextEntriesByKey = new Map<...>()
for (const [entry, multiplicity] of values) {   // ← ALL accumulated values
    if (entry.rowKey == null || multiplicity <= 0 || entry.value == null) continue
    nextEntriesByKey.set(entry.rowKey, {...})
}

The "incremental" benefit is only in steps 2 and 3 — maintaining the sorted array and doing fast-path text splicing. Step 1 (building the target state) is always O(k) where k = entries in the Index for that group.

This is a limitation of the ReduceOperator contract (it always passes all accumulated values), so it's not something stringAgg can easily avoid. But the PR description's emphasis on "O(log n) binary search" and "fast-path string slicing" is somewhat misleading — those optimizations only help after the O(k) scan has already happened.

The good news: the Index does consolidate entries via content hashing (MurmurHash in packages/db-ivm/src/hashing/hash.ts). When a row is removed, the +1 and -1 entries for the same content cancel out and are deleted. So k ≈ number of active rows in the group, not historical total. This makes the O(k) scan reasonable in practice.

2. Correctness depends on a fragile invariant: "last positive-multiplicity entry wins"

This is my biggest concern. The code builds the target state by iterating Index entries and setting nextEntriesByKey[rowKey] for each positive-multiplicity entry, with last-write-wins:

for (const [entry, multiplicity] of values) {
    if (multiplicity <= 0 ...) continue
    nextEntriesByKey.set(entry.rowKey, {...})  // last positive wins
}

Compare this to how existing aggregates work:

  • sum: total += value * multiplicity — algebraically correct regardless of entry order, handles negative multiplicities naturally
  • count: totalCount += nullMultiplier * multiplicity — same: algebraic
  • stringAgg: skips negative multiplicities entirely — correct only if the Index properly consolidates

If the Index ever fails to consolidate (e.g., hash collision between two different pre-mapped objects, or a bug in consolidation), stringAgg would silently include ghost entries that should have been removed. The algebraic aggregates would still be correct because value * (+1) + value * (-1) = 0.

This is a real fragility difference. The hash function is 32-bit MurmurHash — collisions are rare but not impossible, and when they happen, stringAgg would be the first aggregate to produce visibly wrong results.

3. Array.splice() is O(n) — the binary search doesn't help as much as claimed

The sorted array maintenance uses splice for both insert and remove:

// Insert
state.orderedEntries.splice(index, 0, entry)  // O(n) array shift

// Remove  
state.orderedEntries.splice(index, 1)          // O(n) array shift

The O(log n) binary search finds the position, but the O(n) splice dominates. For a group with 10,000 entries, each insert/remove shifts thousands of array elements. A balanced BST or skip list would give true O(log n) insert/remove, but that's probably over-engineering for typical group sizes.

4. Fast-path text splicing: clever but has wasted work

The head/tail text splicing optimization is well-implemented — it correctly uses exact value lengths rather than searching for separator positions, so it handles values containing the separator string correctly (as tested).

However, when textDirty is already true (from a middle-position change), subsequent remove/insert operations still perform fast-path string modifications that will be thrown away:

// If an earlier change set textDirty = true, this string work is wasted
if (index === entryCount - 1) {
    state.text = state.text.slice(0, state.text.length - suffixLength)
    return false  // says "no rebuild needed" but textDirty is already true
}

Not a correctness issue — the rebuild at the end produces the right result. But it's unnecessary string allocation for batch updates that include a middle-position change.

5. The stateful closure pattern is novel and creates lifecycle coupling

This is the first aggregate to use closure-based external state (groupStates Map). All other aggregates (sum, count, avg, min, max, median, mode) are stateless — they compute the result from the full set of values each time.

The stateful pattern requires the new cleanup callback:

type BasicAggregateFunction<T, R, V = unknown, Reduced = V> = {
  preMap: (data: T) => V
  reduce: (values: Array<[V, number]>, groupKey: string) => Reduced
  postMap?: (result: Reduced) => R
  cleanup?: (groupKey: string) => void    // NEW: only needed by stringAgg
}

Plus the reduce function signature change to pass groupKey (also new). These are framework-level changes to support a single aggregate. The cleanup is correctly called when totalMultiplicity <= 0, which handles group deletion. But this creates a coupling between the aggregate's lifecycle and the reduce operator's — if the graph is ever recreated or the reduce operator is reset without calling cleanup, the stale groupStates entries would cause incorrect results.

6. Minor issues in the compiler integration

In packages/db/src/query/compiler/group-by.ts, the stringagg case disambiguates separator vs orderBy by checking expression type:

const separator =
    separatorOrOrderByExpr?.type === `val` &&
    typeof separatorOrOrderByExpr.value === `string`
        ? separatorOrOrderByExpr.value
        : ``

This means stringAgg(col, "") (empty string separator) would work, but stringAgg(col, someStringColumn) would incorrectly treat someStringColumn as an orderBy expression (since it's a ref, not a val). That's actually the correct behavior per the API design — but it means you can't use a column reference as a dynamic separator. Worth documenting.

7. The compareStringAggOrderValues comparison works but has edge cases

Comparing bigint and number via < / > works in JS. Comparing boolean gives false < true. These are fine. But comparing string vs number (e.g., if a user accidentally passes mixed types) gives unpredictable results since JS coerces strings to NaN. The type system should prevent this in practice, but there's no runtime guard.

Summary

Aspect Assessment
Correctness Sound under normal conditions; fragile vs hash collisions
Incremental benefit Real but limited — O(k) scan is unavoidable, splice is O(n)
Best use case Append-only streams (LLM chunks) where tail-insert fast-path shines
Code complexity High for the benefit; the stateful pattern + cleanup adds framework-level changes
Test coverage Excellent — covers inserts, removes, updates, reordering, group deletion/recreation, separator-in-value, fallback path

The main question I have: is the stateful incremental approach worth the complexity? For the LLM streaming use case (appending to the end), the fast-path tail insert avoids an O(total_text_length) rebuild per chunk, which is genuinely valuable. But for other use cases (random inserts/removals in the middle), it degrades to a full rebuild anyway, and the O(k) scan + O(n) splice overhead means it's not dramatically faster than a simple "sort and join" on each call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants